Clustering metrics - alternatives to the elbow method¶
Dr. Tirthajyoti Sarkar, Fremont, CA 94536¶
Clustering is an important part of the machine learning pipeline for business or scientific enterprises utilizing data science. As the name suggests, it helps to identify congregations of closely related (by some measure of distance) data points in a blob of data, which, otherwise, would be difficult to make sense of.
A popular method like k-means clustering does not seem to provide a completely satisfactory answer when we ask the basic question:
"How would we know the actual number of clusters, to begin with?"
This question is critically important because of the fact that the process of clustering is often a precursor to further processing of the individual cluster data and therefore, the amount of computational resource may depend on this measurement.
In the case of a business analytics problem, repercussion could be worse. Clustering is often done for such analytics with the goal of market segmentation. It is, therefore, easily conceivable that, depending on the number of clusters, appropriate marketing personnel will be allocated to the problem. Consequently, a wrong assessment of the number of clusters can lead to sub-optimum allocation of precious resources.
For the k-means clustering method, the most common approach for answering this question is the so-called elbow method. It involves running the algorithm multiple times over a loop, with an increasing number of cluster choice and then plotting a clustering score as a function of the number of clusters.
In this notebook, we show what metric to use for visualizing and determining an optimal number of clusters much better than the usual practice - elbow method.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
n_features = 30
n_cluster = 3
# cluster_std = 1.2
# n_samples = 200
df1 = pd.read_pickle('PAT_3415_wref/PAT_3415_wref_2023-08-23.pkl')
y=df1['Class']
df1.drop('Class',inplace=True,axis=1)
df1
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GA1 | 0.483715 | 0.265764 | -0.119434 | -0.048157 | -0.179256 | -0.155573 | 1.187963 | 0.952131 | -0.086996 | -0.397896 | ... | -0.063387 | -0.262305 | -0.323652 | -0.308666 | -0.591139 | -0.389984 | 0.026109 | -0.020250 | 0.113928 | 0.026257 |
| GA2 | 0.398606 | 0.028651 | -0.203806 | -0.117415 | -0.136427 | -0.114856 | 0.588441 | 0.499296 | -0.328895 | -0.367832 | ... | -0.313107 | -0.438148 | -0.385843 | -0.261361 | -0.386349 | -0.354296 | -0.073311 | -0.142873 | -0.045190 | 0.021992 |
| GA3 | 0.043357 | 0.010885 | -0.104572 | -0.153744 | -0.182513 | -0.102815 | 0.523810 | 0.663739 | -0.045354 | -0.133836 | ... | 0.126345 | 0.809905 | 0.266149 | -0.036997 | -0.131652 | -0.203400 | 0.007803 | -0.021141 | 0.132494 | 0.046397 |
| GA4 | 0.410730 | 0.080677 | 0.036813 | 0.046381 | -0.096244 | -0.080055 | -0.080726 | -0.129069 | -0.340858 | -0.052157 | ... | 0.006090 | 0.984634 | 0.482303 | 0.260356 | -0.220176 | -0.091140 | -0.056507 | -0.014549 | 0.051525 | 0.017754 |
| GA5 | -0.143545 | 0.366892 | 0.107602 | -0.191617 | -0.166783 | -0.030808 | 1.012558 | 0.869969 | 0.159457 | -0.519511 | ... | -0.120698 | 0.458651 | 0.248589 | -0.063716 | -0.572379 | -0.618667 | -0.166705 | -0.059766 | -0.031125 | -0.044360 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| HLG14 | -0.817090 | -1.327095 | -0.533528 | -0.277863 | -0.369859 | -0.155220 | -3.180712 | -3.586907 | -2.039388 | -1.878885 | ... | -1.233885 | -1.501601 | -1.171114 | -1.202723 | -0.023732 | -0.120402 | 0.034683 | -0.434207 | 0.173213 | -0.045776 |
| HLG15 | -0.891595 | -0.966441 | -0.521640 | -0.413265 | -0.386647 | -0.248881 | -3.857659 | -3.783381 | -1.582642 | -1.449072 | ... | -1.079907 | -0.846336 | -0.370741 | -0.215898 | -0.523800 | -0.406407 | -0.049618 | -0.170514 | 0.229267 | 0.289204 |
| HLG16 | -0.986400 | -1.172226 | 0.090839 | -0.226294 | -0.134856 | -0.048451 | -3.553981 | -3.698886 | -1.380141 | -0.985199 | ... | -1.195962 | -0.639447 | -0.436414 | -0.202022 | -0.253749 | -0.075727 | 0.114317 | 0.139163 | 0.266820 | 0.302262 |
| HLG17 | -0.527283 | -0.458474 | -0.149502 | -0.201963 | -0.093792 | -0.110353 | -3.402531 | -3.229067 | -1.917119 | -1.246712 | ... | -1.344066 | -0.738237 | -0.382400 | -0.365429 | 0.145283 | 0.352366 | -0.048786 | 0.160611 | 0.382168 | 0.202780 |
| HLG18 | -0.909549 | -0.687828 | -0.136858 | -0.044270 | 0.050755 | 0.285514 | -2.379198 | -2.333656 | -1.630201 | -1.335235 | ... | -1.260918 | -0.444553 | -0.070491 | -0.239631 | 0.381152 | 0.182001 | -0.153096 | -0.140565 | 0.306024 | 0.199339 |
492 rows × 30 columns
from itertools import combinations
lst_vars=list(combinations(df1.columns,2))
len(lst_vars)
435
for k in range(1,15):
plt.figure(figsize=(15,8))
for i in range(1,30):
plt.subplot(6,5,i)
dim1=lst_vars[i-1][0]
dim2=lst_vars[i-1][1]
plt.scatter(df1[dim1][0:163],df1[dim2][0:163],c=df1[k][0:163],edgecolor='green',s=10)
plt.scatter(df1[dim1][163:163*2],df1[dim2][163:163*2],c=df1[k][163:163*2],edgecolor='blue',s=10)
plt.scatter(df1[dim1][163*2:163*3],df1[dim2][163*2:163*3],c=df1[k][163*2:163*3],edgecolor='red',s=10)
plt.xlabel(f"{dim1}",fontsize=13)
plt.ylabel(f"{dim2}",fontsize=13)
plt.figure(figsize=(15,8))
for i in range(1,30):
plt.subplot(6,5,i)
dim1=lst_vars[i-1][0]
dim2=lst_vars[i-1][1]
plt.scatter(df1[dim1],df1[dim2],c=df1[1],edgecolor='green',s=20)
plt.xlabel(f"{dim1}",fontsize=13)
plt.ylabel(f"{dim2}",fontsize=13)
How are the classes separated (boxplots)¶
plt.figure(figsize=(16,14))
for i,c in enumerate(df1.columns):
plt.subplot(6,5,i+1)
sns.boxplot(y=df1[c],x=df1[1])
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel("Class",fontsize=15)
plt.ylabel(c,fontsize=15)
#plt.show()
k-means clustering¶
from sklearn.cluster import KMeans
Unlabled data¶
X=df1
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[2], line 1 ----> 1 X=df1.T[:-1].T NameError: name 'df1' is not defined
X.tail()
| GA1 | GA2 | GA3 | GA4 | GA5 | GA6 | GA7 | GA8 | GB1 | GB2 | ... | HLG9 | HLG10 | HLG11 | HLG12 | HLG13 | HLG14 | HLG15 | HLG16 | HLG17 | HLG18 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25 | -0.389984 | -0.354296 | -0.2034 | -0.09114 | -0.618667 | -0.6195 | -0.731027 | -0.834111 | -0.194137 | -0.187608 | ... | 0.603079 | 0.474329 | 0.196768 | 0.640251 | 0.33228 | -0.120402 | -0.406407 | -0.075727 | 0.352366 | 0.182001 |
| 26 | 0.026109 | -0.073311 | 0.007803 | -0.056507 | -0.166705 | -1.059621 | -0.757889 | -0.354309 | 0.040578 | 0.014403 | ... | -0.026981 | 0.204087 | 0.19126 | 0.611314 | 0.151943 | 0.034683 | -0.049618 | 0.114317 | -0.048786 | -0.153096 |
| 27 | -0.02025 | -0.142873 | -0.021141 | -0.014549 | -0.059766 | -0.217249 | -0.057531 | -0.174444 | 0.128981 | -0.025893 | ... | -0.035159 | 0.480328 | 0.474677 | 0.505634 | -0.373749 | -0.434207 | -0.170514 | 0.139163 | 0.160611 | -0.140565 |
| 28 | 0.113928 | -0.04519 | 0.132494 | 0.051525 | -0.031125 | -0.263776 | -0.076652 | -0.067927 | 0.025302 | 0.093717 | ... | 0.011437 | 0.550851 | 0.561869 | 0.739036 | 0.09719 | 0.173213 | 0.229267 | 0.26682 | 0.382168 | 0.306024 |
| 29 | 0.026257 | 0.021992 | 0.046397 | 0.017754 | -0.04436 | -0.057738 | -0.014697 | -0.052808 | 0.029368 | 0.026001 | ... | -0.268874 | 0.196157 | 0.240362 | 0.577535 | 0.306662 | -0.045776 | 0.289204 | 0.302262 | 0.20278 | 0.199339 |
5 rows × 492 columns
# y=df1['Class']
Scaling¶
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled=scaler.fit_transform(X)
Metrics¶
from sklearn.metrics import silhouette_score, davies_bouldin_score,v_measure_score
Running k-means and computing inter-cluster distance score for various k values¶
km_scores= []
km_silhouette = []
vmeasure_score =[]
db_score = []
for i in range(2,3):
km = KMeans(n_clusters=i, random_state=0).fit(X_scaled)
preds = km.predict(X_scaled)
print("Score for number of cluster(s) {}: {}".format(i,km.score(X_scaled)))
km_scores.append(-km.score(X_scaled))
silhouette = silhouette_score(X_scaled,preds)
km_silhouette.append(silhouette)
print("Silhouette score for number of cluster(s) {}: {}".format(i,silhouette))
db = davies_bouldin_score(X_scaled,preds)
db_score.append(db)
print("Davies Bouldin score for number of cluster(s) {}: {}".format(i,db))
v_measure = v_measure_score(y,preds)
vmeasure_score.append(v_measure)
print("V-measure score for number of cluster(s) {}: {}".format(i,v_measure))
print("-"*100)
Cannot execute code, session has been disposed. Please try restarting the Kernel.
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.
plt.figure(figsize=(7,4))
plt.title("The elbow method for determining number of clusters\n",fontsize=16)
plt.scatter(x=[i for i in range(2,12)],y=km_scores,s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Number of clusters",fontsize=14)
plt.ylabel("K-means score",fontsize=15)
plt.xticks([i for i in range(2,12)],fontsize=14)
plt.yticks(fontsize=15)
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[1], line 1 ----> 1 plt.figure(figsize=(7,4)) 2 plt.title("The elbow method for determining number of clusters\n",fontsize=16) 3 plt.scatter(x=[i for i in range(2,12)],y=km_scores,s=150,edgecolor='k') NameError: name 'plt' is not defined
plt.scatter(x=[i for i in range(2,12)],y=vmeasure_score,s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("V-measure score")
plt.show()
plt.figure(figsize=(7,4))
plt.title("The silhouette coefficient method \nfor determining number of clusters\n",fontsize=16)
plt.scatter(x=[i for i in range(2,12)],y=km_silhouette,s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Number of clusters",fontsize=14)
plt.ylabel("Silhouette score",fontsize=15)
plt.xticks([i for i in range(2,12)],fontsize=14)
plt.yticks(fontsize=15)
plt.show()
plt.scatter(x=[i for i in range(2,12)],y=db_score,s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Davies-Bouldin score")
plt.show()
Expectation-maximization (Gaussian Mixture Model)¶
from sklearn.mixture import GaussianMixture
gm_bic= []
gm_score=[]
for i in range(2,12):
gm = GaussianMixture(n_components=i,n_init=10,tol=1e-3,max_iter=1000).fit(X_scaled)
print("BIC for number of cluster(s) {}: {}".format(i,gm.bic(X_scaled)))
print("Log-likelihood score for number of cluster(s) {}: {}".format(i,gm.score(X_scaled)))
print("-"*100)
gm_bic.append(-gm.bic(X_scaled))
gm_score.append(gm.score(X_scaled))
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[16], line 4 2 gm_score=[] 3 for i in range(2,12): ----> 4 gm = GaussianMixture(n_components=i,n_init=10,tol=1e-3,max_iter=1000).fit(X_scaled) 5 print("BIC for number of cluster(s) {}: {}".format(i,gm.bic(X_scaled))) 6 print("Log-likelihood score for number of cluster(s) {}: {}".format(i,gm.score(X_scaled))) NameError: name 'X_scaled' is not defined
plt.figure(figsize=(7,4))
plt.title("The Gaussian Mixture model BIC \nfor determining number of clusters\n",fontsize=16)
plt.scatter(x=[i for i in range(2,12)],y=np.log(gm_bic),s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Number of clusters",fontsize=14)
plt.ylabel("Log of Gaussian mixture BIC score",fontsize=15)
plt.xticks([i for i in range(2,12)],fontsize=14)
plt.yticks(fontsize=15)
plt.show()
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[17], line 3 1 plt.figure(figsize=(7,4)) 2 plt.title("The Gaussian Mixture model BIC \nfor determining number of clusters\n",fontsize=16) ----> 3 plt.scatter(x=[i for i in range(2,12)],y=np.log(gm_bic),s=150,edgecolor='k') 4 plt.grid(True) 5 plt.xlabel("Number of clusters",fontsize=14) File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/pyplot.py:2862, in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, data, **kwargs) 2857 @_copy_docstring_and_deprecators(Axes.scatter) 2858 def scatter( 2859 x, y, s=None, c=None, marker=None, cmap=None, norm=None, 2860 vmin=None, vmax=None, alpha=None, linewidths=None, *, 2861 edgecolors=None, plotnonfinite=False, data=None, **kwargs): -> 2862 __ret = gca().scatter( 2863 x, y, s=s, c=c, marker=marker, cmap=cmap, norm=norm, 2864 vmin=vmin, vmax=vmax, alpha=alpha, linewidths=linewidths, 2865 edgecolors=edgecolors, plotnonfinite=plotnonfinite, 2866 **({"data": data} if data is not None else {}), **kwargs) 2867 sci(__ret) 2868 return __ret File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/__init__.py:1442, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs) 1439 @functools.wraps(func) 1440 def inner(ax, *args, data=None, **kwargs): 1441 if data is None: -> 1442 return func(ax, *map(sanitize_sequence, args), **kwargs) 1444 bound = new_sig.bind(ax, *args, **kwargs) 1445 auto_label = (bound.arguments.get(label_namer) 1446 or bound.kwargs.get(label_namer)) File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/axes/_axes.py:4584, in Axes.scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, **kwargs) 4582 y = np.ma.ravel(y) 4583 if x.size != y.size: -> 4584 raise ValueError("x and y must be the same size") 4586 if s is None: 4587 s = (20 if mpl.rcParams['_internal.classic_mode'] else 4588 mpl.rcParams['lines.markersize'] ** 2.0) ValueError: x and y must be the same size
plt.scatter(x=[i for i in range(2,12)],y=gm_score,s=150,edgecolor='k')
plt.show()
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[18], line 1 ----> 1 plt.scatter(x=[i for i in range(2,12)],y=gm_score,s=150,edgecolor='k') 2 plt.show() File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/pyplot.py:2862, in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, data, **kwargs) 2857 @_copy_docstring_and_deprecators(Axes.scatter) 2858 def scatter( 2859 x, y, s=None, c=None, marker=None, cmap=None, norm=None, 2860 vmin=None, vmax=None, alpha=None, linewidths=None, *, 2861 edgecolors=None, plotnonfinite=False, data=None, **kwargs): -> 2862 __ret = gca().scatter( 2863 x, y, s=s, c=c, marker=marker, cmap=cmap, norm=norm, 2864 vmin=vmin, vmax=vmax, alpha=alpha, linewidths=linewidths, 2865 edgecolors=edgecolors, plotnonfinite=plotnonfinite, 2866 **({"data": data} if data is not None else {}), **kwargs) 2867 sci(__ret) 2868 return __ret File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/__init__.py:1442, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs) 1439 @functools.wraps(func) 1440 def inner(ax, *args, data=None, **kwargs): 1441 if data is None: -> 1442 return func(ax, *map(sanitize_sequence, args), **kwargs) 1444 bound = new_sig.bind(ax, *args, **kwargs) 1445 auto_label = (bound.arguments.get(label_namer) 1446 or bound.kwargs.get(label_namer)) File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/axes/_axes.py:4584, in Axes.scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, **kwargs) 4582 y = np.ma.ravel(y) 4583 if x.size != y.size: -> 4584 raise ValueError("x and y must be the same size") 4586 if s is None: 4587 s = (20 if mpl.rcParams['_internal.classic_mode'] else 4588 mpl.rcParams['lines.markersize'] ** 2.0) ValueError: x and y must be the same size